gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649
gh-149584: Fix excessive overhead in the Tachyon profiler regarding the cache behavior#149649pablogsal wants to merge 9 commits into
Conversation
Use exact remote reads for interpreter state, thread state, and interpreter frame structs instead of pulling full remote pages into the profiler page cache. This matches the core change from python#149585.
The profiler clears the page cache between samples, so live entries are always packed at the front. Track the live count and only clear/search that prefix instead of scanning all 1024 slots on the hot path.
|
@maurycy do you mind reviewing this PR? |
Use the frame cache to predict the next thread state and top frame address, then batch interpreter/thread/frame reads with process_vm_readv when profiling a Linux target. Reuse prefetched frame buffers in the frame walker when the prediction is valid.
Cache the last FrameInfo tuple per code object/instruction offset, reuse cached thread id objects, and append cached parent frames directly on full frame-cache hits. This cuts Python allocation churn in the steady-state profiler path.
|
@pablogsal Awesome job, I'm delighted :) Definitely will do! Perhaps it's worth an entry in the "Optimizations" section? One thing that I'm wondering immediately is extending |
Added async benchmark modes so the benchmark harness can exercise |
Since is beta1 I think we are good for now. Later makes sense |
|
Can you give it another go? |
|
@pablogsal Left two more comments:
(If they do not show, you have to check the resolved threads; the comments are follow-ups.) |
|
Will take a look in 1h |
|
@pablogsal One last thing (in the current round): after v. before numbers for the newly added async benchmarks in the description, to see the gain and for future reference (for |
|
I'm traveling so I don't have access to the same box where I run the original ones so I will need to do these in a different one :) |
Thank you. (TailScale FTW :)) |
Some ideas after the discussion in the issue with @maurycy. The profiler was spending too much time on repeated remote-memory bookkeeping, full remote page reads for small fixed-size structs, repeated remote writes of unchanged frame-cache state, and Python object allocation churn on steady-state frame-cache hits.
This PR improves the profiler by:
process_vm_readv()on LinuxFrameInfoand thread id objects when frame-cache hits dominateThe
last_profiled_frameremote-write suppression is already present in currentupstream/main, so this branch keeps that baseline behavior and builds on top of it.Benchmark
Benchmarked with:
For the per-commit measurements, I used the same benchmark workload in quiet mode with
cache_frames=Trueandall_threads=True.upstream/mainbaselineFinal benchmark using the benchmark script:
Final benchmark output:
_remote_debugging: reading whole pages over and over #149584